[V1][Core] Add a cache hit threshold for requests #24520

kfirwolfson · 2025-09-09T15:49:18Z

[V1][Core] Add a cache hit threshold for requests

Purpose

Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.

Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

What this PR adds

Global setting: --global-cache-hit-threshold ([0.0–1.0], default 0.0)
Per-request override: cache_hit_threshold ([0.0–1.0]) in incoming request ChatCompletionRequest / CompletionRequest (validated in the protocol layer).
Finish reason: New enum value and string "cache_threshold" exposed via v1 engine API. Requests rejected by this gating return HTTP 200 with finish_reason="cache_threshold" and no output tokens.
Config visibility & hashing: Threshold is included in VllmConfig and SchedulerConfig.
Bounds & validation: All threshold values validated to range [0.0, 1.0].

Why

Enables Decode-first optimization in P/D disaggregation: when computed-token ratio (local+external) over prompt length is below the threshold, we avoid scheduling low-benefit prefills on decode nodes. This reduces wasted work and remote KV transfers when cache reuse is insufficient.

Backwards compatibility

Default is 0.0 → feature is disabled by default. No behavior change unless the threshold is set globally or per request.

Test Plan

1) Unit Tests

Unit tests check the scheduler logic, including

request threshold overrides global threshold
cache hits from local or external KV cache, or both

2) E2E manual tests

Run vllm serve with --global-cache-hit-threshold 0.8 argument to set a some default value. We'll override it in most requests.

vllm serve <model_path> --served-model "Llama-3.1-8B-Instruct" --global-cache-hit-threshold 0.8

Scheduler computes hit_ratio = computed_tokens / prompt_tokens

We will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on

First request with cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache

Request1: Short 26 tokens will be the prefix of future requests.

Following requests are sent with cache_hit_threshold: 0.33

Request2: Long prompt ≈ 58 tokens → ratio 16/58 ≈ 0.28 → rejected as ratio below threshold
Request3: Medium prompt ≈ 40 tokens → ratio 16/40 ≈ 0.4 → normal generation

The next request is sent without a cache_hit_threshold field, which means global value of 0.8 will take effect

Request4: Medium prompt ≈ 39 tokens → ratio 16/39 ≈ 0.41 → rejected as ratio below global threshold

Request 1) Warm the cache

This run uses cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache for the base segment.

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size",
    "max_tokens": 20,
    "cache_hit_threshold": 0
  }'

Request 2) MISS case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size. Then we continue with many words so that the token length will exceed 16*3 and cache hit rate will be too low to pass the test case threshold",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Request 3) HIT case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix but continue with with whatever text tokens we like and keep it medium after all",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: normal generation ("finish_reason" is not "cache_threshold").

Request 4) MISS case using global threshold

Use global threshold set to 0.8

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix and now continue with different text so the hit rate will be too low",
    "max_tokens": 20
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Notes

Exact token counts can vary slightly by tokenizer/model; we go the numbers above using Llama-3.1-8B-Instruct

Test Result

E2E Local smoke tests on a single node:

Below threshold: responses returned 200 with finish_reason: "cache_threshold" and empty outputs.
- Validated with debug logs
- Request threshold:
  - Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)
- Global threshold:
  - Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)
At/above threshold: normal token generation.
Validators rejected out-of-range values and accepted on boundaries 0.0 and 1.0 (not detailed above)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.

I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.

vllm/v1/core/sched/scheduler.py

vllm/entrypoints/openai/protocol.py

github-actions · 2025-09-09T15:54:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

robertgshaw2-redhat · 2025-09-16T18:04:50Z

@robertgshaw2-redhat self tag

mergify · 2025-10-03T08:35:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kfirwolfson · 2025-10-06T23:19:31Z

(also added to the PR description above)

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

mergify · 2025-10-14T04:33:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-10-16T13:48:32Z

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up.

xref #26813 - a proposal to add a policy that if a request fails because remote KV can't be loaded, we just abort the request rather than falling back to doing the prefill work in the decode instance

kfirwolfson · 2025-10-16T14:33:15Z

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up.

xref #26813 - a proposal to add a policy that if a request fails because remote KV can't be loaded, we just abort the request rather than falling back to doing the prefill work in the decode instance

Good catch, @markmc. I'll comment there - we can possibly join forces. Are you reviewing this PR as well?

elevran · 2025-10-19T10:08:26Z

@kfirwolfson
nit: might be clearer to caller if the return code is not 200/success?
As currently defined, requires the inspection of the payload before the request is rerouted/retried elsewhere. Error codes 429 (too many requests) or 503 (service unavailable) along with a custom header might be friendlier to the routing layer / sidecar.

kfirwolfson · 2025-10-19T10:14:06Z

@elevran it's a good question what code to return. In the preemption use-case we can gain from a 200 response by attaching the output tokens. Please see the "Optimization (phase 2)" section under RFC #24256. The alternative mentioned there is 422.

orozery · 2025-10-27T12:31:13Z

@kfirwolfson looks very good to me! Thanks!

Fix Gemini CR comments Add unit tests Move from SamplingParams to request unit test remake fix static code analysis rejects Fix unit test fix after local CR fix pre-commit reject add threshold to request logger and fix some calls to encode fix ruff Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

kfirwolfson requested review from ProExpertProg, WoosukKwon, aarnphm, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 9, 2025 15:49

mergify bot added frontend v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/protocol.py Outdated Show resolved Hide resolved

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3425995 to 7c0485e Compare September 9, 2025 16:31

kfirwolfson requested review from chaunceyjiang and heheda12345 as code owners September 14, 2025 05:30

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0b75346 to 8be6b61 Compare September 14, 2025 05:58

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 8be6b61 to 0400566 Compare September 30, 2025 10:24

kfirwolfson requested a review from ApostaC as a code owner September 30, 2025 10:24

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 4d756b7 to 0c15acc Compare September 30, 2025 12:59

mergify bot added the needs-rebase label Oct 3, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c15acc to 0c9cb3f Compare October 6, 2025 06:06

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c9cb3f to c087238 Compare October 6, 2025 06:38

mergify bot removed the needs-rebase label Oct 6, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch 2 times, most recently from 06abf34 to eeae693 Compare October 6, 2025 22:53

mergify bot added the needs-rebase label Oct 14, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 908c9fd to e391053 Compare October 16, 2025 09:37

mergify bot added ci/build and removed needs-rebase labels Oct 16, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch 3 times, most recently from b2553c7 to 97698ba Compare October 16, 2025 10:20

markmc mentioned this pull request Oct 16, 2025

[P/D] KV Load Failure Recovery/Abort Configuration #26813

Open

Kfir Wolfson added 3 commits November 3, 2025 17:43

fix ruff rejections

a41de23

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

temp skip pip-compile hook

da39332

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 97698ba to da39332 Compare November 4, 2025 07:28

Uh oh!

[V1][Core] Add a cache hit threshold for requests #24520

Are you sure you want to change the base?

[V1][Core] Add a cache hit threshold for requests #24520

Conversation

kfirwolfson commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

1) Unit Tests

2) E2E manual tests

Request 1) Warm the cache

Request 2) MISS case

Request 3) HIT case

Request 4) MISS case using global threshold

Notes

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

robertgshaw2-redhat commented Sep 16, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

kfirwolfson commented Oct 6, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

markmc commented Oct 16, 2025

Uh oh!

kfirwolfson commented Oct 16, 2025

Uh oh!

elevran commented Oct 19, 2025

Uh oh!

kfirwolfson commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orozery commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kfirwolfson commented Sep 9, 2025 •

edited by github-actions bot

Loading

kfirwolfson commented Oct 19, 2025 •

edited

Loading